Global Climate Change

A Data Science Tutorial by Jamie Lebovics

According to NASA$_1$, at least 97% of scientists believe that global climate change is real, and that it is caused by human activity. Yet, it is still considered a controversial debate topic, and many people are misinformed, and people in power are not adequately changing policies to alleviate the problem that our planet is facing, and by extension, the entire population is facing.

This tutorial will show how to look at some existing data about the global and United States climate and visualize it in a way that meaningfully conveys that climate change is real. It is important to continue to do analyses like this one in order to continue to educate those who may be misinformed because of the other 3% of scientists who have gotten a disproportionate voice, making people doubt the need for policy changes aimed at stopping climate change.

Further, I'll explore at how different parts of the world may be affected differently.

1 - source and further reading: https://climate.nasa.gov/scientific-consensus/

Data Collection

I downloaded climate data from Kaggle$_2$, which is behind a login and therefore could not be included directly in this tutorial, and wrote the data to a pandas DataFrame. Then, since I also need information about where each country is bycontinent and region, I accessed the a raw table with this information that has been hosted publicly on github$_3$

2 - https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data

3 - https://githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv

In [3]:
import requests

url = "https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv"
r = requests.get(url)

# display just the first 1000 characters to get an idea of what the page looks like
r.text[:1000]
Out[3]:
'name,alpha-2,alpha-3,country-code,iso_3166-2,region,sub-region,region-code,sub-region-code\nAfghanistan,AF,AFG,004,ISO 3166-2:AF,Asia,Southern Asia,142,034\nÃ…land Islands,AX,ALA,248,ISO 3166-2:AX,Europe,Northern Europe,150,154\nAlbania,AL,ALB,008,ISO 3166-2:AL,Europe,Southern Europe,150,039\nAlgeria,DZ,DZA,012,ISO 3166-2:DZ,Africa,Northern Africa,002,015\nAmerican Samoa,AS,ASM,016,ISO 3166-2:AS,Oceania,Polynesia,009,061\nAndorra,AD,AND,020,ISO 3166-2:AD,Europe,Southern Europe,150,039\nAngola,AO,AGO,024,ISO 3166-2:AO,Africa,Middle Africa,002,017\nAnguilla,AI,AIA,660,ISO 3166-2:AI,Americas,Caribbean,019,029\nAntarctica,AQ,ATA,010,ISO 3166-2:AQ,,,,\nAntigua and Barbuda,AG,ATG,028,ISO 3166-2:AG,Americas,Caribbean,019,029\nArgentina,AR,ARG,032,ISO 3166-2:AR,Americas,South America,019,005\nArmenia,AM,ARM,051,ISO 3166-2:AM,Asia,Western Asia,142,145\nAruba,AW,ABW,533,ISO 3166-2:AW,Americas,Caribbean,019,029\nAustralia,AU,AUS,036,ISO 3166-2:AU,Oceania,Australia and New Zealand,009,053\nAustria,AT,AUT,040,ISO '

Data Processing

In this section, I'll manipulate the data from the two sources to be usable later in. That means adding columns for the year and the month to DataFrame made from the kaggle data, integrating the region information into that main DataFrame, and calculating the averages of the data over each year and integrating that data into the DataFrame. There will be more subtle data manipulating later on as well, but the main work happens here.

First, I'll import package necessary for managing the data. Pandas holds the data in an easy-to-manipulate DataFrame.

In [4]:
import pandas

The table with region info on each country has to be converted from a csv to a pandas DataFrame. There are commas in some of the country names which have to be specially accounted for since the columns are also separated by columns.

In [5]:
# this should have been a one-line list comprehension but some countries have a comma in their names and quotes
# instead of an escape character so i needed to account for the special case
regions = []
for row in r.text.split('\n'):
    if '\"' not in row:
        regions.append([col for col in row.split(',')])
    else:
        lst = [col for col in row.split(',')]
        name = (lst[0]+", "+lst[1]).strip('\"')
        lst = lst[2:]
        lst.insert(0,name)
        regions.append(lst)
    
regions = pandas.DataFrame(regions[1:], columns=regions[0])
regions.set_index('name', inplace=True)
regions.head()
Out[5]:
alpha-2 alpha-3 country-code iso_3166-2 region sub-region region-code sub-region-code
name
Afghanistan AF AFG 004 ISO 3166-2:AF Asia Southern Asia 142 034
Ã…land Islands AX ALA 248 ISO 3166-2:AX Europe Northern Europe 150 154
Albania AL ALB 008 ISO 3166-2:AL Europe Southern Europe 150 039
Algeria DZ DZA 012 ISO 3166-2:DZ Africa Northern Africa 002 015
American Samoa AS ASM 016 ISO 3166-2:AS Oceania Polynesia 009 061

Next, process the climate data by creating a pandas DataFrame, and adding columns with year information, derived from the string date column 'dt'

In [6]:
data = pandas.DataFrame.from_csv('GlobalLandTemperaturesByCountry.csv', index_col=None)

year = [int(dt[:4]) for dt in data['dt']]
month = [int(dt[5:7]) for dt in data['dt']]
data['year'] = year
data['month'] = month

# Impute rows without temperatures listed
data = data[data['AverageTemperature'].isnull()==False]

data.head()
Out[6]:
dt AverageTemperature AverageTemperatureUncertainty Country year month
0 1743-11-01 4.384 2.294 Ã…land 1743 11
5 1744-04-01 1.530 4.680 Ã…land 1744 4
6 1744-05-01 6.702 1.789 Ã…land 1744 5
7 1744-06-01 11.609 1.577 Ã…land 1744 6
8 1744-07-01 15.342 1.410 Ã…land 1744 7

Calculate the yearly averages for each country and add them to the DataFrame

In [7]:
intermediate = data.groupby(['Country', 'year']).mean().reset_index()
intermediate.rename(columns={'AverageTemperature': 'YearlyAverageTemperature', \
                             'AverageTemperatureUncertainty': 'YearlyAverageTemperatureUncertainty'}, inplace=True)
intermediate = intermediate.drop('month', axis=1)

data = pandas.merge(data, intermediate, how='right', on=['Country', 'year'])
data.head()
Out[7]:
dt AverageTemperature AverageTemperatureUncertainty Country year month YearlyAverageTemperature YearlyAverageTemperatureUncertainty
0 1743-11-01 4.384 2.294 Ã…land 1743 11 4.3840 2.294000
1 1744-04-01 1.530 4.680 Ã…land 1744 4 6.6985 1.987625
2 1744-05-01 6.702 1.789 Ã…land 1744 5 6.6985 1.987625
3 1744-06-01 11.609 1.577 Ã…land 1744 6 6.6985 1.987625
4 1744-07-01 15.342 1.410 Ã…land 1744 7 6.6985 1.987625

Add the region data (only the region and sub-region columns) into the main DataFrame.

In [8]:
intermediate = []
for country in data['Country'].unique():
    for c in regions.itertuples():
        if country in c.Index:
            intermediate.append([country, c.region, c._6])

df = pandas.DataFrame(intermediate, columns=['Country','region','sub_region'])
data = pandas.merge(data, df, how='right', on=['Country'])
data.head()
Out[8]:
dt AverageTemperature AverageTemperatureUncertainty Country year month YearlyAverageTemperature YearlyAverageTemperatureUncertainty region sub_region
0 1743-11-01 4.384 2.294 Ã…land 1743 11 4.3840 2.294000 Europe Northern Europe
1 1744-04-01 1.530 4.680 Ã…land 1744 4 6.6985 1.987625 Europe Northern Europe
2 1744-05-01 6.702 1.789 Ã…land 1744 5 6.6985 1.987625 Europe Northern Europe
3 1744-06-01 11.609 1.577 Ã…land 1744 6 6.6985 1.987625 Europe Northern Europe
4 1744-07-01 15.342 1.410 Ã…land 1744 7 6.6985 1.987625 Europe Northern Europe

Exploratory Analysis & Data Visualization

Next, I'll plot some of the data what has been collected, processed, and compiled into a single DataFrame. There will be plots for the United States, the world divided by region, and each continent divided by sub-region.

In [9]:
import matplotlib.pyplot as plt
from ggplot import *
import numpy as np
import warnings
warnings.filterwarnings('ignore')  # for fewer visual distractions

Below, the plot for the the United States clearly shows a gradual rise in temperature.

In [8]:
us_data = data[['Country','year','YearlyAverageTemperature','YearlyAverageTemperatureUncertainty']].drop_duplicates()\
[data['Country']=="United States"]

ggplot(aes(x='year', y='YearlyAverageTemperature'), data=us_data) +\
    geom_point() +\
    ggtitle("United States Average Temerature over time") +\
    xlab('Year') +\
    ylab('Average Temperature (Celsius)') +\
    stat_smooth(span=0.2)
Out[8]:
<ggplot: (294426481)>

Next, we'll look at global data, and see if the world regions are useful ways of grouping climates.

In [9]:
global_data = data[['Country','year','YearlyAverageTemperature','YearlyAverageTemperatureUncertainty',\
                    'region','sub_region']].drop_duplicates()

for r in data['region'].unique():
    g = ggplot(aes(x='year', y='YearlyAverageTemperature', color='sub_region'), \
               data=global_data[global_data['region']==r]) +\
        geom_point() +\
        ggtitle("%s Average Temerature over time"%(r)) +\
        xlab('Year') +\
        ylab('Average Temperature (Celsius)') +\
        stat_smooth(colour='blue', span=0.5)
    print(g)
<ggplot: (276272399)>
<ggplot: (309625180)>
<ggplot: (309742856)>
<ggplot: (-9223372036567380301)>
<ggplot: (288608362)>
<ggplot: (288609747)>

It turns out the temperatures don't cluster well by region. Since climate and global temparature patterns are quite complex, it would be interesting to explore further and see what would be a good way to group temperature by locations, and I'll be doing exactly that in the next section.

Machine Learning

The goal in this section is to find a new set of regions that better predicts the tempaerature trends of those countries within that region. Some large countries, like the United States, Russia, and Australia would probably benefit from being divided further into smaller regions to anylize, but for the simplicity I'll consider a country to be the smallest geographic unit.

Some of the code was adapted from the example available through scikit-learn$_{4,5}$. The sci-kit learn documentation is a great resource for learning about K-Means and other machine learning approaches. There is a helpful tutorial that goes in depth on K-Means which I have linked to below$_6$

First, import the additional packages which will be necessary. I'll be using scikit-learn, as previously mentioned, anlong with scipy for some additional work. I'll also be using numpy here, which was imported earlier.

4 - http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html

5 - http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py

6 - https://www.datascience.com/blog/introduction-to-k-means-clustering-algorithm-learn-data-science-tutorials

In [10]:
from sklearn.cluster import KMeans
import scipy

Next, prepare the data as a numpy array. Use scikit-learn's K-Means tools to find the clusters for a cluster size of 15. I chose this number after trying out different numbers and finding that this one gave relatively good results. The plot shows the clusters grouped by color.

In [39]:
X = np.array(data[['AverageTemperature','month']])
# normalize the data
X = scipy.stats.zscore(X)
# multiply Temperature col by 100 so it gets weighted more heavily  
X[:,0] *= 100

y = np.array(len(data)*[0])

est = KMeans(n_clusters=15)
fig = plt.figure(1, figsize=(4, 3))
plt.clf()

fig, ax = plt.subplots()

plt.cla()
est.fit(X)
labels = est.labels_

ax.scatter(X[:, 0], X[:, 1], c=labels.astype(np.float))
ax.set_xlabel('Temp Z-score*100')
ax.set_ylabel('Month Z-score')

plt.show()
<matplotlib.figure.Figure at 0x11b13ec50>

Now that there are new region labels, add those to the main DataFrame so that the data can be plotted by region. For each region, show it overall with a smoothed conditional mean in red on top, and then show that same data sorted by country. Notice the countries in each region.

In [54]:
data['new_region'] = est.labels_

global_data = data[['Country','year','YearlyAverageTemperature','YearlyAverageTemperatureUncertainty',\
                    'new_region']].drop_duplicates()

for r in data['new_region'].unique():
    g = ggplot(aes(x='year', y='YearlyAverageTemperature'), \
               data=global_data[global_data['new_region']==r]) +\
        geom_point() +\
        ggtitle("Region #%s Average Temerature over time"%(r)) +\
        xlab('Year') +\
        ylab('Average Temperature (Celsius)') +\
        stat_smooth(color='red', span=0.5)
    print(g)
    g = ggplot(aes(x='year', y='YearlyAverageTemperature',color='Country'), \
               data=global_data[global_data['new_region']==r]) +\
        geom_point() +\
        ggtitle("Region #%s Average Temerature over time"%(r)) +\
        xlab('Year') +\
        ylab('Average Temperature (Celsius)') 
    print(g)
<ggplot: (297550642)>
<ggplot: (-9223372036557102904)>
<ggplot: (-9223372036557084949)>
<ggplot: (-9223372036557107340)>
<ggplot: (-9223372036550834358)>
<ggplot: (-9223372036550834295)>
<ggplot: (-9223372036557204017)>
<ggplot: (-9223372036550647797)>
<ggplot: (294514366)>
<ggplot: (-9223372036557194601)>
<ggplot: (296153664)>
<ggplot: (304415040)>
<ggplot: (-9223372036550360646)>
<ggplot: (-9223372036560166540)>
<ggplot: (310075056)>
<ggplot: (-9223372036544680574)>
<ggplot: (-9223372036545407947)>
<ggplot: (309367851)>
<ggplot: (-9223372036553435325)>
<ggplot: (-9223372036553668208)>
<ggplot: (301107674)>
<ggplot: (296744312)>
<ggplot: (-9223372036560189198)>
<ggplot: (309991330)>
<ggplot: (309995752)>
<ggplot: (316255429)>
<ggplot: (-9223372036534739056)>
<ggplot: (-9223372036550295215)>
<ggplot: (-9223372036558618452)>
<ggplot: (324950683)>

Insight & Policy Decision

The graphs produced by K-means are still not perfect regional breakdowns, but they do provide a different picture, where regions can be determined by climate and season, rather than arbitrary geographic lines. Within each region, you can see a different slope as the average temperature changes over time, because different regions are affected differently. A couple have even gotten cooler over time. It's called "Global Climate Change," and not "Global Warming," because as is evident, not every region is warming up, and certainly not all at the same rates. The Global climate is shifting and changing, affecting different areas differently. For more information, consider reading about the Eath's Climate System$_7$ and even more evidence of climate change by NASA$_8$.

With this knowledge about global climate change, we can advocate for further research into how different regions are affected, and what efforts in each region would best mediate the effects that have already been set in motion.

7 - https://www.ipcc.ch/ipccreports/tar/wg1/pdf/TAR-01.PDF

8 - https://climate.nasa.gov/evidence/